Statistical inference using weights and survey design

4. Stata examples (Exercises only)

Pierre Walthéry

UK Data Service

October 2025

Survey design in Stata

  • Stata provides comprehensive support for computing survey design-informed estimates from survey data

  • Implementation logic similar to R:

    • Declare the survey design using svyset
  • svyset id=psu_var [pweights=weight_var],strata(strata_var)

  • Use svy: - prefixed commands for estimation:

    • svy:mean myvar, svy:tab myvar etc…

Stata command-based weighting - 1

  • Users may add sampling weights to most Stata estimation commands, or use survey-specific commands. The latter is recommended.

  • Stata distinguishes between four kinds of (dealing with) weights:

    • frequency weights (fweight),
    • analytical weights (aweight),
    • importance weights (iweight) and
    • probability weights (pweight).
  • These mostly differ in the way standard errors are computed

Stata command-based weighting - 2

  • Survey weights should be treated as probability weights or pw.

  • Key estimation commands, such as summarise or tab do not allow using pw: this is to nudge users to rely on the svy: commands instead .

  • ‘On the fly’ weighting (i.e. not using survey design functions) in Stata consists in the weighting variables being specified between square brackets.

  • stata_command myvar [pw=weight_var]

  • It is tempting to to specify instead the wrong kind of weights function (fw or aw) if one does not wish to use the survey design functions. You may get the correct point estimates, but your standard errors are likely to be incorrect Do this at your own risk

Question 3

  • What would be the consequences of:

    • weighing but not accounting for the sample design;
  • neither using weights or accounting for the survey design?

    • When:
      • inferring the mean age in the population?
      • computing the uncertainty of this estimate?

Stata version

  • Opening the dataset and declaring the survey design (scroll down for full output)

. use ~/Data/bsa/UKDA-8450-stata/bsa2017_for_ukda.dta,clear

. 
. svyset Spoint [pw=WtFactor], strata(StratID) 

Sampling weights: WtFactor
             VCE: linearized
     Single unit: missing
        Strata 1: StratID
 Sampling unit 1: Spoint
           FPC 1: <zero>

. 

  • Computing the survey design-informed version of the mean…
. svy: mean RAgeE
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 159            Number of obs   =      3,988
Number of PSUs   = 372            Population size = 3,988.0019
                                  Design df       =        213

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
       RAgeE |   48.31309   .4235811      47.47815    49.14804
--------------------------------------------------------------

  • And the other two versions:

.   mean RAgeE [pw=WtFactor]

Mean estimation                          Number of obs = 3,988

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       RAgeE |   48.31309   .3435013      47.63964    48.98655
--------------------------------------------------------------

. mean RAgeE

Mean estimation                          Number of obs = 3,988

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
       RAgeE |   52.19358   .2872482      51.63041    52.75675
--------------------------------------------------------------

. 

Answer - Stata

  • Not using weights results in overestimating the mean age in the population (of those aged 18+) by about 4 years.
  • This might be due to the fact that older respondents are more likely to take part to surveys. # - Using command-based weighting does not alter the value of the estimated population mean when compared with SD informed estimates…
  • … but would lead us to overestimating the precision/underestimate the uncertainty of our estimate – by about plus or minus 3 months.

Proportions and their 95% CI


. ** Creating a dummy variable for significant interest in politics
. quietly recode Politics 1 2 =1 3/8=0,gen(Politics2) 

. ** Survey-design informed frequencies...
. svy:ta Politics2  
(running tabulate on estimation sample)

Number of strata = 159                            Number of obs   =      3,988
Number of PSUs   = 372                            Population size = 3,988.0019
                                                  Design df       =        213

----------------------
RECODE of |
Politics  |
(How much |
interest  |
do you    |
have in   |
politics? |
)         | proportion
----------+-----------
        0 |      .5699
        1 |      .4301
          | 
    Total |          1
----------------------
Key: proportion = Cell proportion

. 

  • … Proportions and CIs
. svy:ta Politics2, percent ci 
(running tabulate on estimation sample)

Number of strata = 159                            Number of obs   =      3,988
Number of PSUs   = 372                            Population size = 3,988.0019
                                                  Design df       =        213

----------------------------------------------
RECODE of |
Politics  |
(How much |
interest  |
do you    |
have in   |
politics? |
)         | percentage          lb          ub
----------+-----------------------------------
        0 |      56.99       55.06        58.9
        1 |      43.01        41.1       44.94
          | 
    Total |        100                        
----------------------------------------------
Key: percentage = Cell percentage
             lb = Lower 95% confidence bound for cell percentage
             ub = Upper 95% confidence bound for cell percentage

Question 4

  • What is the proportion of respondents aged 17-34 in the sample, as well as its 95% confidence interval?

    • You can use RAgecat5

Answer

  • Same for age categories
. svy:ta RAgecat5, percent ci                         
(running tabulate on estimation sample)

Number of strata = 159                            Number of obs   =      3,988
Number of PSUs   = 372                            Population size = 3,988.0019
                                                  Design df       =        213

----------------------------------------------
Age of    |
responden |
t(grouped |
)         |
<3-catego |
ry> dv    | percentage          lb          ub
----------+-----------------------------------
    17-34 |      28.46       26.46       30.56
    35-54 |      33.98        32.3       35.69
      55+ |       37.5       35.67       39.36
   DK/Ref |      .0601       .0214       .1683
          | 
    Total |        100                        
----------------------------------------------
Key: percentage = Cell percentage
             lb = Lower 95% confidence bound for cell percentage
             ub = Upper 95% confidence bound for cell percentage

Question 5

  • What is the 95% confidence interval for the proportion of people significantly interested in politics in the North East?
  • Is the proportion likely to be different in London? In what way?
  • What is the region of the UK for the estimates are likely to be least precise?

Not accounting for domain estimation

. svy:prop Politics2 if GOR_ID==1, percent  cformat(%9.1f)
(running proportion on estimation sample)

Survey: Percent estimation

Number of strata =  8             Number of obs   =        180
Number of PSUs   = 16             Population size = 167.274844
                                  Design df       =          8

--------------------------------------------------------------
             |             Linearized            Logit
             |    Percent   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   Politics2 |
          0  |       66.6          .             .           .
          1  |       33.4          .             .           .
--------------------------------------------------------------
Note: Missing standard errors because of stratum with single
      sampling unit.

… And accounting for it

  • % interested in politics in the North East…
. svy,subpop(if GOR_ID==1):prop Politics2, percent  cformat(%9.1f)
(running proportion on estimation sample)

Survey: Percent estimation

Number of strata =  8             Number of obs   =        214
Number of PSUs   = 19             Population size = 199.035253
                                  Subpop. no. obs =        180
                                  Subpop. size    = 167.274844
                                  Design df       =         11

--------------------------------------------------------------
             |             Linearized            Logit
             |    Percent   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   Politics2 |
          0  |       66.6        3.1          59.4        73.1
          1  |       33.4        3.1          26.9        40.6
--------------------------------------------------------------
Note: 151 strata omitted because they contain no subpopulation
      members.

  • … And in London
. svy,subpop(if GOR_ID==7):prop Politics2, percent  cformat(%9.1f)
(running proportion on estimation sample)

Survey: Percent estimation

Number of strata = 21             Number of obs   =        409
Number of PSUs   = 45             Population size = 538.816569
                                  Subpop. no. obs =        409
                                  Subpop. size    = 538.816569
                                  Design df       =         24

--------------------------------------------------------------
             |             Linearized            Logit
             |    Percent   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   Politics2 |
          0  |       45.8        3.4          38.9        52.8
          1  |       54.2        3.4          47.2        61.1
--------------------------------------------------------------
Note: 138 strata omitted because they contain no subpopulation
      members.

Question 6

Using interest in politics as before, and three category age RAgecat5:

  • Produce a table showing the proportion of respondents significantly interested in politics by age group and gender
  • Assess whether the age difference in interest for politics is similar for each gender.
  • Is it fair to say that men aged under 35 are more likely to declare being interested in politics than women aged 55 and above?

Q6 - Answer


. *  Men under 35
. svy,subpop(if RAgecat5==1 & Rsex==1):prop Politics2 ,percent  cformat(%9.1f) 
(running proportion on estimation sample)

Survey: Percent estimation

Number of strata = 138            Number of obs   =      3,591
Number of PSUs   = 326            Population size = 3,608.3226
                                  Subpop. no. obs =        339
                                  Subpop. size    = 573.962978
                                  Design df       =        188

--------------------------------------------------------------
             |             Linearized            Logit
             |    Percent   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   Politics2 |
          0  |       57.3        2.6          52.0        62.4
          1  |       42.7        2.6          37.6        48.0
--------------------------------------------------------------
Note: 21 strata omitted because they contain no subpopulation
      members.

. * Women 55+ 
.   svy,subpop(if RAgecat5==3 & Rsex==2):prop Politics2, percent  cformat(%9.1f
> ) 
(running proportion on estimation sample)

Survey: Percent estimation

Number of strata = 155            Number of obs   =      3,916
Number of PSUs   = 364            Population size = 3,908.1585
                                  Subpop. no. obs =        928
                                  Subpop. size    = 796.172719
                                  Design df       =        209

--------------------------------------------------------------
             |             Linearized            Logit
             |    Percent   std. err.     [95% conf. interval]
-------------+------------------------------------------------
   Politics2 |
          0  |       57.0        1.8          53.5        60.4
          1  |       43.0        1.8          39.6        46.5
--------------------------------------------------------------
Note: 4 strata omitted because they contain no subpopulation
      members.

. 
  • Contrast with:
svy: tab  Politics2 if RAgecat5==1 & Rsex==1, percent ci